home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
ftp.cs.arizona.edu
/
ftp.cs.arizona.edu.tar
/
ftp.cs.arizona.edu
/
icon
/
newsgrp
/
group94a.txt
/
000111_icon-group-sender _Wed May 4 13:47:52 1994.msg
< prev
next >
Wrap
Internet Message Format
|
1994-08-19
|
2KB
Received: by cheltenham.cs.arizona.edu; Wed, 4 May 1994 12:28:50 MST
From: "Art Eschenlauer" <eschen@molbio.cbs.umn.edu>
Message-Id: <9405041847.AA05442@molbio.cbs.umn.edu>
Subject: pattern searching for molecular biology
To: icon-group@cs.arizona.edu
Date: Wed, 4 May 94 13:47:52 CDT
X-Mailer: ELM [version 2.3test PL26]
Status: R
Errors-To: icon-group-errors@cs.arizona.edu
I have a molecular biology problem that I am trying to solve with Icon.
It boils down to pattern searching. I suppose I should look up pattern
matching algorithms in a real comp sci book, but if you wish to read
further....
Genes consist of strings of bases A,C,G, and T. I want to look for patterns
in those strings (patterns that are recognized by enzymes that cut the
DNA at specific sequences in the DNA). So, I have a file of about 200
different recognition sequences, e.g., EcoRI is an enzyme that recognizes
and cuts at GAATTC sequences. Additionally, some of those recog seqs are
redundant; for example, StyI recognizes CCWWGG, where W = A|T, and MslI
recognizes CAYNNNNRTG, where Y = C|T, R = A|G, and N = A|C|G|T. Many recog
seqs are about 6 bases, but they range from 4 to 15 bases.
I want a way to match these sequences that is efficient with memory and
time. What is the FASTEST way to do this in Icon? The FIRST way that I
can think of doing it is with string invocation, e.g.,
# &subject has gene sequence, and recog sites are found using
# every i := find(
# ((site := SiteGeneratorFunction())[1:2])( site[2:0] )
# ) do { whatever }
# # the [1:2] substring is taken to produce
# # the 1 char name of the procedure to be invoked
# ... declare procedures A,B,C,D,G,H,K,M,N,R,S,T, and W
procedure Y(RestOfSite)
if RestOfSite == "" then return ""
suspend ( move(1) == ("C"|"T") ) ||
(RestOfSite[1:2])( RestOfSite[2:0] )
end
(Please pardon [and point out] any gross errors ... I'm do not
have perfect understanding of string scanning and generators yet.)
Will that be fast? Is there a faster algorithm that is obvious or some
feature of Icon that I am missing? I hesitate to use coexpressions, since,
for example, CAYNNNNRTG would produce 2^2 * 4^4 = 1024 results!